Intel(R) Extension for Scikit-learn* enabled (https://github.com/intel/scikit-learn-intelex)
The data was collected from Turkish students at two faculties: Faculty of Engineering and Faculty of Educational Sciences students in 2019.
The goal is to create an ML model that can predict student performance given the data taken from a survey.
The grades are in categorical –AA, BA, BB, CB, CC, DC, DD, and Fail– hence it should be model as a mutli-class classification.
Data Set Information
The data contains results from a survey with columns 1-10 relate to personal questions, 11-16 are family related, and the remaining questions include education habits.
The outcome data –the grades– shows an imbalanced distribution. Whilst DD has 25% of the data, BA and CB have less than 10% and Fail represents only 5.5% of the whole data –only eight points–. This eventually will present a problem as the model will have few data points to train on predicting the Fail grade, but more data points to train the model on predicting the DD grade.
Error, Data Transformation, K-Fold and Metrics
Precision and recall provide insights into the model's performance for each class individually, while accuracy gives an overall view of the model's correctness. Since this is a multi-class classification problem, precision and recall are calculated individually for each class and then averaged.
Precision: measures the proportion of correctly predicted grades out of all grades predicted as a specific grade. In this case, when predicting an AA grade what proportion of all predicted AA grades where truly AA grades. The procedure is repeated for each individual grade. High precision indicates that the model is good at correctly identifying a specific grade without misclassifying with the other grades. However, it doesn't consider the case when a grade was not predicted as the real grade.
Recall: measures the proportion of correctly predicted grades out of all actual grades in the set. In this case, when predicting an AA grade what proportion of all AA grades were predicted as AA grades. The procedure is repeated for each individual grade. High recall indicates that the models good at predicting most of the grades from each category to its real category.
Accuracy: measures the overall correctness of the model's predictions across all grades. It calculates the proportion of correctly predicted grades out of the total number of grades. It provides an overall assessment of the model's performance, considering both correct predictions for identifying the real and false grade category. However, it may not be the most informative metric when dealing with imbalanced datasets, where the number of instances in each class varies significantly.
Train and Test Subsets
Since the data is imbalanced when splitting to the train and test sets the imbalance has to taken into account. The even split is needed so that the data can train using all possible outcomes – with a distribution comparable to the expected in none seen data.
The following list shows the estimators –and their parameters– that are studied to identify the best possible model:
log = LogisticRegression(penalty=None, random_state=6064, solver='saga', max_iter=7500, multi_class='multinomial', n_jobs=-1)
l1 = LogisticRegression(penalty='l1', random_state=6064, solver='saga', max_iter=7500, multi_class='multinomial', n_jobs=-1)
l2 = LogisticRegression(penalty='l2', random_state=6064, solver='sag', max_iter=10500, multi_class='multinomial', n_jobs=-1)
net = LogisticRegression(penalty='elasticnet', random_state=6064, solver='saga', max_iter=10500, multi_class='multinomial', n_jobs=-1, l1_ratio=0.5)
sgd = SGDClassifier(loss='modified_huber', penalty=None, max_iter=7500, n_jobs=-1, random_state=6064)
mlp = MLPClassifier(solver='adam', max_iter=4500, random_state=6064)
dtc = DecisionTreeClassifier(random_state=6064)
rfc = RandomForestClassifier(random_state=6064, n_jobs=1)
etc = ExtraTreeClassifier(random_state=6064)
ets = ExtraTreesClassifier(random_state=6064, n_jobs=1)
abc = AdaBoostClassifier(random_state=6064)
gpc = GaussianProcessClassifier(kernel=RBF(0.05), random_state=6064, n_jobs=1)
gbc = GradientBoostingClassifier(loss='log_loss', random_state=6064)
svc = SVC(kernel=RBF(), probability=True)
| Model | Accuracy | Recall weighted | Precision weighted | AUC |
|---|---|---|---|---|
| logisticregression | 0.250000 | 0.229200 | 0.188900 | 0.625700 |
| logisticregression_l1 | 0.333300 | 0.291700 | 0.223600 | 0.714700 |
| logisticregression_l2 | 0.250000 | 0.194400 | 0.156500 | 0.644300 |
| logisticregression_elasticnet | 0.305600 | 0.263900 | 0.210800 | 0.683800 |
| sgd | 0.277800 | 0.250000 | 0.193500 | 0.551800 |
| mlp | 0.361100 | 0.312500 | 0.241300 | 0.635900 |
| decisiontree | 0.166700 | 0.131900 | 0.103500 | 0.503500 |
| randomforest | 0.416700 | 0.347200 | 0.291700 | 0.695700 |
| extratree | 0.138900 | 0.152800 | 0.121500 | 0.512600 |
| adaboost | 0.250000 | 0.159700 | 0.084600 | 0.607700 |
| extratrees | 0.416700 | 0.375000 | 0.343100 | 0.723000 |
| gaussianprocess | 0.083300 | 0.125000 | 0.010400 | 0.500000 |
| gradientboosting | 0.388900 | 0.395800 | 0.309000 | 0.661400 |
| svc | 0.250000 | 0.125000 | 0.031200 | 0.326000 |
| Model | Accuracy | Recall weighted | Precision weighted | AUC |
|---|---|---|---|---|
| logisticregression | 0.277800 | 0.243100 | 0.214600 | 0.632400 |
| logisticregression_l1 | 0.333300 | 0.284700 | 0.229900 | 0.716900 |
| logisticregression_l2 | 0.277800 | 0.222200 | 0.199300 | 0.637900 |
| logisticregression_elasticnet | 0.277800 | 0.201400 | 0.165300 | 0.673600 |
| sgd | 0.250000 | 0.180600 | 0.194400 | 0.572400 |
| mlp | 0.250000 | 0.250000 | 0.156200 | 0.600300 |
| decisiontree | 0.250000 | 0.284700 | 0.211800 | 0.586200 |
| randomforest | 0.250000 | 0.222200 | 0.160100 | 0.660200 |
| extratree | 0.194400 | 0.138900 | 0.116000 | 0.510100 |
| adaboost | 0.194400 | 0.201400 | 0.168500 | 0.605000 |
| extratrees | 0.250000 | 0.208300 | 0.194400 | 0.706400 |
| gaussianprocess | 0.083300 | 0.125000 | 0.010400 | 0.500000 |
| gradientboosting | 0.305600 | 0.291700 | 0.215300 | 0.661900 |
| svc | 0.250000 | 0.125000 | 0.031200 | 0.333800 |
| Model | Accuracy | Recall weighted | Precision weighted | AUC |
|---|---|---|---|---|
| logisticregression | 0.416700 | 0.354200 | 0.354200 | 0.700400 |
| logisticregression_l1 | 0.305600 | 0.250000 | 0.219200 | 0.677200 |
| logisticregression_l2 | 0.277800 | 0.277800 | 0.213200 | 0.675700 |
| logisticregression_elasticnet | 0.250000 | 0.215300 | 0.173600 | 0.703400 |
| sgd | 0.222200 | 0.194400 | 0.135400 | 0.582000 |
| mlp | 0.333300 | 0.298600 | 0.183300 | 0.651500 |
| decisiontree | 0.250000 | 0.180600 | 0.152800 | 0.534000 |
| randomforest | 0.277800 | 0.250000 | 0.163200 | 0.674900 |
| extratree | 0.305600 | 0.284700 | 0.181900 | 0.591800 |
| adaboost | 0.222200 | 0.194400 | 0.140300 | 0.737700 |
| extratrees | 0.305600 | 0.243100 | 0.181900 | 0.741300 |
| gaussianprocess | 0.083300 | 0.125000 | 0.010400 | 0.500000 |
| gradientboosting | 0.250000 | 0.236100 | 0.148600 | 0.649300 |
| svc | 0.250000 | 0.125000 | 0.031200 | 0.364700 |
| Model | Accuracy | Recall weighted | Precision weighted | AUC |
|---|---|---|---|---|
| logisticregression | 0.416700 | 0.354200 | 0.354200 | 0.700400 |
| logisticregression_l1 | 0.444400 | 0.375000 | 0.361100 | 0.685400 |
| logisticregression_l2 | 0.388900 | 0.319400 | 0.267400 | 0.699700 |
| logisticregression_elasticnet | 0.444400 | 0.375000 | 0.361100 | 0.691400 |
| sgd_l1 | 0.444400 | 0.340300 | 0.229700 | 0.760700 |
| mlp | 0.416700 | 0.395800 | 0.276400 | 0.678200 |
| decisiontree | 0.444400 | 0.409700 | 0.315300 | 0.671300 |
| randomforest | 0.472200 | 0.395800 | 0.313900 | 0.710800 |
| extratree | 0.444400 | 0.409700 | 0.291700 | 0.723300 |
| adaboost | 0.472200 | 0.444400 | 0.308300 | 0.777200 |
| extratrees | 0.555600 | 0.513900 | 0.465800 | 0.744000 |
| gaussianprocess | 0.083300 | 0.125000 | 0.010400 | 0.500000 |
| gradientboosting | 0.388900 | 0.312500 | 0.152200 | 0.626500 |
| svc | 0.444400 | 0.333300 | 0.236100 | 0.633700 |
| Model | Accuracy | Recall weighted | Precision weighted | AUC |
|---|---|---|---|---|
| logisticregression | 0.310300 | 0.310300 | 0.316300 | 0.721700 |
| logisticregression_l1 | 0.275900 | 0.275900 | 0.422400 | 0.734400 |
| logisticregression_l2 | 0.344800 | 0.344800 | 0.405400 | 0.720700 |
| logisticregression_elasticnet | 0.344800 | 0.344800 | 0.405400 | 0.737800 |
| sgd_l1 | 0.206900 | 0.206900 | 0.115000 | 0.628700 |
| mlp | 0.137900 | 0.137900 | 0.133600 | 0.627800 |
| decisiontree | 0.206900 | 0.206900 | 0.181000 | 0.486100 |
| randomforest | 0.310300 | 0.310300 | 0.250000 | 0.715900 |
| extratree | 0.172400 | 0.172400 | 0.232200 | 0.566100 |
| adaboost | 0.172400 | 0.172400 | 0.137900 | 0.538100 |
| extratrees | 0.172400 | 0.172400 | 0.187700 | 0.538900 |
| gaussianprocess | 0.069000 | 0.069000 | 0.004800 | 0.500000 |
| gradientboosting | 0.275900 | 0.275900 | 0.171300 | 0.605500 |
| svc | 0.206900 | 0.206900 | 0.298300 | 0.565500 |
Overall Test Performance Report
TRAIN TEST Accuracy: 0.929 0.345 Recall: 0.929 0.345 Precision: 0.931 0.405 AUC: 0.995 0.721
Test Set Classification Report
precision recall f1-score support
AA 1.00 0.33 0.50 3
BA 1.00 0.67 0.80 3
BB 0.67 0.67 0.67 3
CB 0.00 0.00 0.00 2
CC 0.27 0.75 0.40 4
DC 0.33 0.20 0.25 5
DD 0.14 0.14 0.14 7
Fail 0.00 0.00 0.00 2
accuracy 0.34 29
macro avg 0.43 0.34 0.34 29
weighted avg 0.41 0.34 0.34 29
Overall Test Performance Report
TRAIN TEST Accuracy: 1.000 0.310 Recall: 1.000 0.310 Precision: 1.000 0.250 AUC: 1.000 0.716
Test Set Classification Report
precision recall f1-score support
AA 1.00 0.67 0.80 3
BA 0.00 0.00 0.00 3
BB 0.00 0.00 0.00 3
CB 0.00 0.00 0.00 2
CC 0.17 0.25 0.20 4
DC 0.25 0.20 0.22 5
DD 0.33 0.71 0.45 7
Fail 0.00 0.00 0.00 2
accuracy 0.31 29
macro avg 0.22 0.23 0.21 29
weighted avg 0.25 0.31 0.26 29
In conclusion, the evaluation of various models reveals their performance on the classification task. The results demonstrate the impact of feature selection and hyperparameter optimization on model performance. The best-performing model, the Logistic Regression with l1 penalization, shows promising results in terms of accuracy, recall, precision, and AUC in comparison to the other classifiers.
Nonetheless, the performance of such model is still poor – given the fact that the tunning process is made for just one model applied to each grade leaving the rest out. If a model per grade is developed and fine-tuned better classification performances can be achieved.